Mapping Reddit

Reddit is a discussion board that bills itself as the "Front Page of the Internet". It is divided into a large number of topic-specific "subreddits". In this demo, we'll take some data about which subreddits various active Reddit users post to a lot to make a visual map of subreddits. The data comes from the paper Navigating the massive world of reddit.

Initializing pymldb

In this demo, we will use pymldb to interact with the REST API: see the Using pymldb Tutorial for more details.


In [1]:
from pymldb import Connection
mldb = Connection("http://localhost")

Loading up the raw data


In [2]:
mldb.put('/v1/procedures/import_reddit', { 
    "type": "import.text",  
    "params": { 
        "dataFileUrl": "http://public.mldb.ai/reddit.csv.gz",
        'delimiter':'', 
        'quoteChar':'',
        'outputDataset': 'reddit_raw',
        'runOnCreation': True
    } 
})


Out[2]:
PUT http://localhost/v1/procedures/import_reddit
201 Created
{
  "status": {
    "firstRun": {
      "runStarted": "2016-05-28T03:43:43.374825Z", 
      "status": {
        "numLineErrors": 0
      }, 
      "runFinished": "2016-05-28T03:43:45.0417228Z", 
      "id": "2016-05-28T03:43:43.374753Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "quoteChar": "", 
      "delimiter": "", 
      "outputDataset": "reddit_raw", 
      "runOnCreation": true, 
      "dataFileUrl": "http://public.mldb.ai/reddit.csv.gz"
    }, 
    "type": "import.text", 
    "id": "import_reddit"
  }, 
  "state": "ok", 
  "type": "import.text", 
  "id": "import_reddit"
}

And here is what our raw dataset looks like. The lineText column will need to be parsed: it's comma-delimited, with the first token being a user ID and the remaining tokens being the set of subreddits that user contributed to.


In [3]:
mldb.query("select * from reddit_raw limit 5")


Out[3]:
lineText
_rowName
1 603,politics,trees,pics
2 604,Metal,AskReddit,tattoos,redditguild,WTF,co...
3 605,politics,IAmA,AdviceAnimals,movies,smallbu...
4 606,CrohnsDisease,birthcontrol,IAmA,AdviceAnim...
5 607,space,Fitment,cars,Economics,Libertarian,2...

Transforming the raw data into a sparse matrix

We will create and run a Procedure of type transform. The tokenize function will project out the subreddit names into columns.


In [4]:
mldb.put('/v1/procedures/reddit_import', {
    "type": "transform",
    "params": {
        "inputData": "select tokenize(lineText, {offset: 1, value: 1}) as * from reddit_raw",
        "outputDataset": "reddit_dataset",
        "runOnCreation": True
    }
})


Out[4]:
PUT http://localhost/v1/procedures/reddit_import
201 Created
{
  "status": {
    "firstRun": {
      "runStarted": "2016-05-28T03:43:45.3741944Z", 
      "status": {
        "columnCount": 15122, 
        "rowCount": 876961, 
        "valueCount": 892067
      }, 
      "runFinished": "2016-05-28T03:43:51.7200754Z", 
      "id": "2016-05-28T03:43:45.374114Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "outputDataset": "reddit_dataset", 
      "inputData": "select tokenize(lineText, {offset: 1, value: 1}) as * from reddit_raw", 
      "runOnCreation": true
    }, 
    "type": "transform", 
    "id": "reddit_import"
  }, 
  "state": "ok", 
  "type": "transform", 
  "id": "reddit_import"
}

Here is the resulting dataset: it's a sparse matrix with a row per user and a column per subreddit, where the cells are 1 if the row's user was a contributor to the column's subreddit, and null otherwise.


In [5]:
mldb.query("select * from reddit_dataset limit 5")


Out[5]:
AdviceAnimals AskReddit LucidDreaming Mustang WTF cars ems funny guns motorcycles ... fffffffuuuuuuuuuuuu gaming movies "reddit.com" trees videos breakingbad cringe darksouls thatHappened
_rowName
785489 1 1 1 1 1 1 1 1 1 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
754094 NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
785459 1 1 NaN NaN 1 1 NaN 1 NaN NaN ... 1 1 1 1 1 1 NaN NaN NaN NaN
785298 NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
785140 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 1 1 1 1

5 rows × 32 columns

Dimensionality Reduction with Singular Value Decomposition (SVD)

We will create and run a Procedure of type svd.train.


In [6]:
mldb.put('/v1/procedures/reddit_svd', {
    "type" : "svd.train",
    "params" : {
        "trainingData" : """
            SELECT 
                COLUMN EXPR (AS columnName() ORDER BY rowCount() DESC, columnName() LIMIT 4000) 
            FROM reddit_dataset
        """,
        "columnOutputDataset" : "reddit_svd_embedding",
        "runOnCreation": True
    }
})


Out[6]:
PUT http://localhost/v1/procedures/reddit_svd
201 Created
{
  "status": {
    "firstRun": {
      "runStarted": "2016-05-28T03:43:54.4468389Z", 
      "runFinished": "2016-05-28T03:44:44.2194996Z", 
      "id": "2016-05-28T03:43:54.446762Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "columnOutputDataset": "reddit_svd_embedding", 
      "trainingData": "\n            SELECT \n                COLUMN EXPR (AS columnName() ORDER BY rowCount() DESC, columnName() LIMIT 4000) \n            FROM reddit_dataset\n        ", 
      "runOnCreation": true
    }, 
    "type": "svd.train", 
    "id": "reddit_svd"
  }, 
  "state": "ok", 
  "type": "svd.train", 
  "id": "reddit_svd"
}

The result of this operation is a new dataset with a row per subreddit for the 4000 most-active subreddits and columns representing coordinates for that subreddit in a 100-dimensional space.

Note: the row names are the subreddit names followed by ".numberEquals.1" because the SVD training procedure interpreted the input matrix as categorical rather than numerical.


In [8]:
mldb.query("select * from reddit_svd_embedding limit 5")


Out[8]:
embedding.0 embedding.10 embedding.11 embedding.12 embedding.13 embedding.14 embedding.15 embedding.16 embedding.17 embedding.18 ... embedding.91 embedding.92 embedding.93 embedding.94 embedding.95 embedding.96 embedding.97 embedding.98 embedding.99 embedding.9
_rowName
AskReddit.numberEquals.1 -0.448854 0.002584 0.071048 0.004983 -0.004218 -0.000756 -0.014381 0.039270 0.011690 0.005458 ... -0.001824 -0.001005 0.002710 0.001701 0.002573 -0.002802 -0.004249 0.006107 -0.000439 0.066779
funny.numberEquals.1 -0.400773 -0.096315 -0.002975 0.027832 0.006565 -0.009726 0.020593 0.019796 -0.006885 -0.020714 ... 0.000278 0.000375 0.001001 -0.000875 0.000552 -0.001406 0.001127 -0.002564 -0.002319 0.311920
pics.numberEquals.1 -0.380045 0.041943 -0.038168 -0.073641 0.054020 0.070904 0.024925 -0.028070 0.015659 0.013717 ... 0.004381 -0.003049 -0.001057 -0.003187 -0.000768 -0.001563 0.000897 -0.001195 -0.002240 -0.468059
WTF.numberEquals.1 -0.294057 0.131549 -0.170047 -0.081603 0.048413 -0.017429 -0.013951 -0.024105 -0.052710 0.028045 ... 0.000223 0.003586 0.003579 -0.000195 0.004218 -0.001208 0.001949 0.003942 0.003031 0.005529
gaming.numberEquals.1 -0.263956 -0.123462 -0.109681 0.100280 0.050675 -0.071520 -0.129815 0.166366 0.046277 -0.155068 ... 0.005391 -0.005024 0.011013 -0.003873 0.002043 0.009336 -0.001159 0.001016 0.002622 -0.062798

5 rows × 100 columns

Clustering with K-Means

We will create and run a Procedure of type kmeans.train.


In [9]:
mldb.put('/v1/procedures/reddit_kmeans', {
    "type" : "kmeans.train",
    "params" : {
        "trainingData" : "select * from reddit_svd_embedding",
        "outputDataset" : "reddit_kmeans_clusters",
        "numClusters" : 20,
        "runOnCreation": True
    }
})


Out[9]:
PUT http://localhost/v1/procedures/reddit_kmeans
201 Created
{
  "status": {
    "firstRun": {
      "runStarted": "2016-05-28T03:44:58.9105241Z", 
      "runFinished": "2016-05-28T03:44:59.093301Z", 
      "id": "2016-05-28T03:44:58.910438Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "numClusters": 20, 
      "outputDataset": "reddit_kmeans_clusters", 
      "trainingData": "select * from reddit_svd_embedding", 
      "runOnCreation": true
    }, 
    "type": "kmeans.train", 
    "id": "reddit_kmeans"
  }, 
  "state": "ok", 
  "type": "kmeans.train", 
  "id": "reddit_kmeans"
}

The result of this operation is a simple dataset which associates each row in the input (i.e. each subreddit) to one of 20 clusters.


In [10]:
mldb.query("select * from reddit_kmeans_clusters limit 5")


Out[10]:
cluster
_rowName
usyd.numberEquals.1 5
Mommit.numberEquals.1 16
minnesotatwins.numberEquals.1 3
HongKong.numberEquals.1 2
niggers.numberEquals.1 18

2-d Dimensionality Reduction with t-SNE

We will create and run a Procedure of type tsne.train.


In [11]:
mldb.put('/v1/procedures/reddit_tsne', {
    "type" : "tsne.train",
    "params" : {
        "trainingData" : "select * from reddit_svd_embedding",
        "rowOutputDataset" : "reddit_tsne_embedding",
        "runOnCreation": True
    }
})


Out[11]:
PUT http://localhost/v1/procedures/reddit_tsne
201 Created
{
  "status": {
    "firstRun": {
      "runStarted": "2016-05-28T03:45:01.2912471Z", 
      "runFinished": "2016-05-28T03:45:12.5964549Z", 
      "id": "2016-05-28T03:45:01.291181Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "trainingData": "select * from reddit_svd_embedding", 
      "rowOutputDataset": "reddit_tsne_embedding", 
      "runOnCreation": true
    }, 
    "type": "tsne.train", 
    "id": "reddit_tsne"
  }, 
  "state": "ok", 
  "type": "tsne.train", 
  "id": "reddit_tsne"
}

The result is similar to the SVD step above: we get a row per subreddit and the columns are coordinates, but this time in a 2-dimensional space appropriate for visualization.


In [12]:
mldb.query("select * from reddit_tsne_embedding limit 5")


Out[12]:
x y
_rowName
usyd.numberEquals.1 -30.012499 4.485752
Mommit.numberEquals.1 8.447520 -47.081745
minnesotatwins.numberEquals.1 -50.094040 -18.705278
HongKong.numberEquals.1 -34.990097 1.323213
niggers.numberEquals.1 26.637928 -0.875668

Counting the number of users per subreddit

We will create and run a Procedure of type transform on the transpose of the original input dataset.


In [13]:
mldb.put('/v1/procedures/reddit_count_users', {
    "type": "transform",
    "params": {
        "inputData": "select columnCount() as numUsers from transpose(reddit_dataset)",
        "outputDataset": "reddit_user_counts",
        "runOnCreation": True
    }
})


Out[13]:
PUT http://localhost/v1/procedures/reddit_count_users
201 Created
{
  "status": {
    "firstRun": {
      "runStarted": "2016-05-28T03:45:20.0345097Z", 
      "status": {
        "columnCount": 1, 
        "rowCount": 15122, 
        "valueCount": 15123
      }, 
      "runFinished": "2016-05-28T03:45:21.165891Z", 
      "id": "2016-05-28T03:45:20.034423Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "outputDataset": "reddit_user_counts", 
      "inputData": "select columnCount() as numUsers from transpose(reddit_dataset)", 
      "runOnCreation": true
    }, 
    "type": "transform", 
    "id": "reddit_count_users"
  }, 
  "state": "ok", 
  "type": "transform", 
  "id": "reddit_count_users"
}

We appended "|1" to the row names in this dataset to allow the merge operation below to work well.


In [14]:
mldb.query("select * from reddit_user_counts limit 5")


Out[14]:
numUsers
_rowName
oracle 27
notcollectingstamps 1
notarealsub 1
norge 446
police 89

Querying and Visualizating the output

We'll use the Query API to get the data into a Pandas DataFrame and then use Bokeh to visualize it.

In the query below we renamed the rows to get rid of the "|1" which the SVD appended to each subreddit name and we filter out rows where cluster is null because we only clustered the 4000 most-active subreddits.


In [17]:
df = mldb.query("""
    select c.* as *, m.* as *, quantize(m.x, 7) as grid_x, quantize(m.y, 7) as grid_y 
    named c.rowName() 
    from merge(reddit_tsne_embedding, reddit_kmeans_clusters) as m
        join reddit_user_counts as c on c.rowName() = m.rowPathElement(0)
    where m.cluster is not null 
    order by c.numUsers desc
""")
df.head()


Out[17]:
cluster grid_x grid_y numUsers x y
_rowName
AskReddit 2 0 -42 523005 -1.118738 -43.699772
funny 3 -49 0 396478 -45.672779 -1.286311
pics 0 -35 -42 362588 -33.671326 -43.146740
WTF 16 21 -49 262293 21.992300 -49.596294
gaming 13 -14 49 255763 -12.789426 48.143551

In [18]:
import numpy as np
colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c", 
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5", 
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f", 
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

import bokeh.plotting as bp
from bokeh.models import HoverTool

In [19]:
#this line must be in its own cell 
bp.output_notebook()


Loading BokehJS ...

In [20]:
x = bp.figure(plot_width=900, plot_height=700, title="Subreddit Map by t-SNE",
       tools=[HoverTool( tooltips=[ ("/r/", "@subreddit") ] )], toolbar_location=None,
       x_axis_type=None, y_axis_type=None, min_border=1)
x.scatter(
    x = df.x.values, 
    y=df.y.values, 
    color=colormap[df.cluster.astype(int).values],
    alpha=0.6,
    radius=(df.numUsers.values ** .3)/15,
    source=bp.ColumnDataSource({"subreddit": df.index.values})
)

labels = df.reset_index().groupby(['grid_x', 'grid_y'], as_index=False).first()
labels = labels[labels["numUsers"] > 10000]
x.text(
    x = labels.x.values, 
    y = labels.y.values,
    text = labels._rowName.values,
    text_align="center", text_baseline="middle",
    text_font_size="8pt", text_font_style="bold",
    text_color="#333333"
)

bp.show(x)


Out[20]:

<Bokeh Notebook handle for In[20]>

Where to next?

Check out the other Tutorials and Demos.


In [ ]: